📊 LLM Evals - tyler · Scour

Jankmarking: Janky Benchmarking ⚙️Performance Profiling

williamangel.net·3d·Hacker News

Why does AI memory fail at connecting facts? I ran the benchmarks to find out ⚙️Systems Programming

yourmemoryai.xyz·2d·Hacker News, r/SideProject

Mapping AI benchmarks onto a common capability scale ⌚Quantified Self

aiiq.org·1h·Hacker News

Open Source Robot Policies, Datasets, and Benchmarks 🤖Game AI

festivus.hapticlabs.ai·39m·Hacker News

Recursive Multi-Agent Systems 🤖Agent-Based Simulations

recursivemas.github.io·1d·Hacker News

ZAYA1-8B: An 8B Moe Model with 760M Active Params Matching DeepSeek-R1 on Math 🕸️WASM

firethering.com·5d·Hacker News

Interfaze: A new model architecture built for high accuracy at scale 💬Prompt Engineering

interfaze.ai·1d·Hacker News

Lies, damned lies, and Elastic's benchmarks ⚙️Systems Programming

gouthamve.dev·2d·Hacker News

ProgramBench: Can Language Models Rebuild Programs From Scratch? 🔧Code Generation

arxiv.org·6d·Hacker News

MySQL hypergraph optimizer 📈Query Optimization

blog.sesse.net·2d

Foundation Model Engineering: From theory to production 💬Prompt Engineering

sungeuns.github.io·4d·Hacker News

BintzGavin/apastra: Lightweight prompt versioning, evals, benchmarks, and delivery 💬Prompt Engineering

github.com·4d·Hacker News

We ran OWASP attacks on 8 LLMs. Optimized small models beat frontier defaults 🔐Cybersecurity

megacode.ai·5d·Hacker News

SubQ: A New LLM with a 12M Token Context That Rivals Claude and ChatGPT 💬Prompt Engineering

felloai.com·6d·Hacker News

Benchmarking AI agent retrieval strategies on Kubernetes bug fixes 📊Performance Monitoring

cncf.io·4d·Hacker News

We Ran 250 AI Agent Evals to Find Out if Skills Beat Docs. The Answer Is More Complicated Than We Expected 🤖Creative Automation

wix.engineering·6d·Hacker News

Show HN: Vibe code your agents without vibe coding your agent 💬Prompt Engineering

deepeval.com·3d·Hacker News

Optimize for change not application performance ⚙️Performance Profiling

echooff.dev·3d·Hacker News

hpke-ng: Faster, Smaller, Harder HPKE for Rust ⚙️Systems Programming

symbolic.software·4d·Lobsters, r/rust

e-Bike Fleet Monitoring 📊Running Analytics

tech.marksblogg.com·6d·Hacker News

Log in to enable infinite scrolling